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Abstract. Recent experimental advances in biology allow researchers to 
obtain gene expression profiles at single-cell resolution over hundreds, or 
even thousands of cells at once. These single-cell measurements provide 
snapshots of the states of the cells that make up a tissue, instead of the 
population-level averages provided by conventional high-throughput ex¬ 
periments. This new data therefore provides an exciting opportunity for 
computational modelling. In this paper we introduce the idea of viewing 
single-cell gene expression profiles as states of an asynchronous Boolean 
network, and frame model inference as the problem of reconstructing 
a Boolean network from its state space. We then give a scalable algo¬ 
rithm to solve this synthesis problem. We apply our technique to both 
simulated and real data. We first apply our technique to data simulated 
from a well established model of common myeloid progenitor differentia¬ 
tion. We show that our technique is able to recover the original Boolean 
network rules. We then apply our technique to a large dataset taken dur¬ 
ing embryonic development containing thousands of cell measurements. 
Our technique synthesises matching Boolean networks, and analysis of 
these models yields new predictions about blood development which our 
experimental collaborators were able to verify. 


1 Introduction 


As biological data becomes more accurate and becomes available in larger vol¬ 
umes, researchers are increasingly adopting concepts from computer science to 
the modelling and analysis of living systems. Formal methods have been success¬ 
fully applied to gain insights into biological processes and to direct the design 
of new experiments [3j-[5 12 . New single-cell resolution gene expression mea¬ 
surement technology provides an exciting opportunity for modelling biological 
systems at the cellular level. Single-cell gene expression profiles provide a snap¬ 
shot of the true states that cells can reach in the real experimental system, a 
level of detail which has not been available before 15,18 . A major challenge for 


researchers is to move beyond established methods for the analysis of population 
data, to new techniques that take advantage of single-cell resolution data [li]. 

Uncovering and understanding the gene regulatory networks (GRNs) which 
underlie the behaviour of stem and progenitor cells is a central issue in molecular 







cell biology. These GRNs control the self-renewal and differentiation capabilities 
of the stem cells that maintain adult tissues, and become perturbed in diseases 
such as cancer. They also specify the complex developmental processes that lead 
to the initial formation of tissues in the embryo. Understanding how to effectively 
control GRNs can lead to important insights for the programmed generation of 
clinically-relevant cell types important for regenerative medicine, as well as into 
the design of molecular therapies to target cancerous cells. 


Biological systems can be modelled at different levels of abstraction. At a 
molecular level, the biochemical events which occur inside a cell can be cap¬ 
tured by stochastic processes, given by chemical master equations 24 . These 


chemical events are fundamentally stochastic, driven by random fluctuations of 
molecules present at low concentrations and by Brownian motion. Asynchronous 
Boolean networks abstract away details of transcription, translation and molec¬ 
ular binding reactions and represent the status of each modelled substance as 
either active (on) or inactive (off), while using non-determinism to capture dif¬ 
ferent options that arise from stochastic behaviour 7,13 27 . In the cell, gene 
activity is controlled by combinatorial logic in which proteins called transcrip¬ 
tion factors cooperate to physically bind to a regulatory DNA region of a gene 
and trigger (or inhibit) its transcription. Target genes may in turn code for tran¬ 
scription factors, forming a complex GRN. Asynchronous Boolean networks are 
particularly well suited to modelling GRNs because the combinatorial logic reg¬ 
ulating gene activity can be expressed as a Boolean function. For example, gene 
X may be activated by either the presence of gene A or by the presence of both 
genes B and C. The presence of a repressor D may prevent X from becoming 
triggered by the presence of these activating genes. When modelling the differ¬ 
entiation of a cell using an asynchronous Boolean network, dynamics proceed by 
a series of single-gene changes. Mature, differentiated cell types correspond to 
stable attractor states of the model. 


Predictions about the modes of interaction between genes resulting from 
computational analysis can be tested experimentally through a range of assays. 
For example, if analysis of a model predicts that gene X is activated by gene A, 
a ChIP (Chromatin ImmunoPrecipitation) assay can be used to assess whether 
the protein coded for by A binds to a regulatory region of X. Then, perturbations 
which prevent the binding of A to this region can be introduced, and the effect 
that this has on the expression of X can be examined. 


State-space analyses of hand-built asynchronous Boolean network models 
based on literature-derived gene regulatory interactions have been successfully 
applied to model cell fate decisions, and to reproduce known experimental results 
(e.g„ HUP] ). Here we address the problem of automatically constructing such 
models directly from data. If we think of single-cell gene expression profiles as the 
state space of an asynchronous Boolean network, can we identify the underlying 
gene regulatory logic that could have generated this data? 


We encode the matching of an asynchronous Boolean network to a state 
space as a synthesis problem and use constraint (satisfiability) solving techniques 
for answering the synthesis problem. The synthesised network has to match 
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Fig. 1. Single-cell gene expression measurements for two genes, in 3934 cells. 


the data in two aspects. First, the resulting network should try to minimise 
transitions to expression points that are not part of the sampled data. Second, 
the resulting network should allow for a progression through the state space 
in a way that matches the flow of time through the different experiments that 
produced the data. A direct encoding of this problem into a satisfiability problem 
does not scale well. We suggest a modular search that handles parts of the state 
space and the network and does not need to reason about the entire network 
at once. We consider two test cases. First, we try to reconstruct an existing 
asynchronous Boolean network from its state space. We are able to reconstruct 
Boolean rules from the original network. Second, we apply our technique to 
experimental data derived from blood cell development. The network that is 
produced by our technique matches known dependencies and suggests interesting 
novel predictions. Some of these predictions were validated by our collaborators. 

This paper describes the algorithm that we used to obtain the results in a 
recently published biological paper on a single-cell resolution study of embryonic 
blood development 16 . The biological paper includes full details of the exper¬ 


iment that generated the data, and the biological validation of our resulting 
synthesised model. Here, we cover the algorithmic aspects of our method. 


2 Biological Motivation 

Single-cell gene expression experiments produce gene expression profiles for indi¬ 
vidually measured cells. Each of these gene expression profiles is a vector where 
each element gives the level of expression of one gene in that cell. Figure [T] plots 
the level of the genes Etv2 and Runxi over 3934 cells. 

Our experimental collaborators performed such gene expression profiling on 
five batches of cells taken from four sequential developmental time points of 
a mouse embryo. For each time point, the experiment aimed to capture every 
cell with the potential to develop into a blood cell, providing a comprehensive 
single-cell resolution picture of the developmental timecourse of blood develop¬ 
ment. This resulted in a data set of 3934 cell measurements. Full details of this 
experiment and our analysis can be found in 116 . This data set is the first of its 
kind, attempting to capture an entire tissue’s worth of progenitor cells across a 
developmental time course. This level of coverage of the potential cell state space 
is required for our approach to accurately recover gene regulatory networks, and 
requires the measurement of thousands of cell profiles. Later we will introduce a 
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Fig. 2. State graph. Node colours correspond to the time point at which a state was 
measured. States from the earliest of the time points are coloured blue, and states from 
the last time point are coloured red. 


synthetic data set of a few hundred cell states in order to illustrate how our ap¬ 
proach works, but we would like to stress that to be usable on real experimental 
data our algorithm needs to be able to scale thousands of cell states. 

For each of 3934 cells, the level of expression of 33 transcription factor genes 
was measured. Expression levels are non-negative real numbers, where the value 
0 indicates that the given gene is unexpressed in the cell (see Figure [I]). 

The key idea introduced in this paper is to view this gene-expression data as 
a sample from the state-space of an asynchronous Boolean network. In the past, 
manually curated Boolean networks have been successfully used to recapitulate 
experimental results [2 if 13 . Such Boolean networks were hand-constructed 
from biological knowledge that has accumulated in the literature over many 
years. Here, we aim to produce such Boolean networks automatically, directly 
from gene expression data, by employing synthesis techniques. We aim to pro¬ 
duce a Boolean network that can explain the data and can be used to inform 
biological experiments for uncovering the nature of gene regulatory networks in 
real biological systems. 

In order to convert the data into a format that can be viewed as a Boolean 
network state space, we first discretise expression values to binary, assigning the 
value 1 to all non-zero gene expression measurements. A value of zero corre¬ 
sponds to the discovery threshold of the equipment used to produce the data. 
Discretising the 3934 expression profiles in this way yields 3070 unique binary 
states, where every state is a vector of 33 Boolean values corresponding to the 
activation/inactivation level of each of 33 genes in a given cell. In an asyn¬ 
chronous Boolean network, transitions correspond to the change of value of a 
single variable. Hence, we next look for pairs of states that differ by only one 
gene (that is, the Hamming distance between the two vectors is 1). An analysis 
of the strongly-connected components of this graph shows that one strongly con¬ 
nected component contains 44% of the states. We note that in a random sample 
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of 3934 elements from a space of 2 33 , the chance of seeing repeats or neighbours 
with Hamming-distance 1 is negligible. 

A plot of the graph of the largest strongly connected component is given in 
Figure[2j We add an edge for every Hamming-distance 1 pair and cluster together 
highly connected nodes. The colours of nodes correspond to the developmental 
time the measurements was taken. Note that there is a clear separation between 
the earliest developmental time point and the latest one. This representation 
already suggests a clear change of states over the development of the embryo, 
with separate clusters identifiable and obvious fate transitions between clusters. 

We wish to find an asynchronous Boolean network that matches this graph. 
For that we impose several restrictions on the Boolean network. Connections 
between states correspond to a change in the value of one gene, however, we 
do not know the direction of the change. Thus, we search simultaneously for 
directions and update functions of the different genes that satisfy the following 
two conditions: states from the earliest developmental time point should be able 
to evolve, through a series of single-gene transitions, to the states from the 
latest developmental time point. Secondly, the update functions must minimise 
the number of transitions that lead to additional, unobserved states, that were 
not measured in the experiment. 


3 Example: Reconstructing an ABN from its State Space 


We first illustrate our synthesis method 
using an example. We take an existing 
Boolean network, construct its associated 
state space, and then use this state space 
as input to our synthesis method in order 
to try to reconstruct the Boolean network 
that we started with. 

Krumsiek et. al. introduce a Boolean 
network model of the core regulatory net¬ 
work active in common myeloid progeni¬ 
tor cells 13]. Their network is based upon 
a comprehensive literature survey. It includes a set of 11 Boolean variables (corre¬ 
sponding to genes) and a Boolean update function for each variable (Figure [ 3 ]) j^] 
The model is given a well-defined initial starting state, representing the ex¬ 
pression profile of the common myeloid progenitor, and computational analysis 
reveals an acyclic, hierarchical state space of 214 states with four stable state 
attractors (Figure [4|. 

These stable attractors are in agreement with experimental expression pro¬ 
files of megakaryocytes, erythrocytes, granulocytes and monocytes; four of the 
mature myeloid cell types that develop from common myeloid progenitors. 

We treat the state space of this Boolean network as we would treat experi¬ 
mental data, forgetting all directionality information, and connecting all states 


Gene 

Update function 

Gata2 

Gata2f\ -<(Pu.l v ( Gatal A Fogl)) 

Gatal 

( Gatal v Gata2\/ Flil) A ->Pu.l 

Fogl 

Gatal 

EKLF 

Gatal A ->Flil 

Flil 

Gatal A -1 EKLF 

Scl 

Gatal A -> Pu. 1 

Cebpa 

Cebpa A ->{Scl V (Fogl A Gatal)) 

Pu.l 

( Cebpa V Pu.l) A -.( Gatal V Gata2) 

cjun 

Pu.l A -1 Gfil 

EgrNab 

( Pu. 1 A cJun) A -1 Gfil 

Gfil 

Cebpa A EgrNab 


Fig. 3. Boolean update functions 
for a manually curated network. 


The function of Cebpa is modified from that in 
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to match the format we assume. 
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Fig. 4. Boolean network state space. Initial Fig. 5. Close-up of Boolean network 
state is coloured green, stable states red. state space. 

which differ in the expression of only one gene by an undirected edge (Figures [4] 
and [5] where each edge is labelled with the single gene that changes in value 
between the states it connects). We would now like to reconstruct the Boolean 
network given in Figure [3] from this undirected state space. 

For each gene, we would like to assign a direction to each of its labelled edges 
(or decide that it does not exist), in a way that is compatible with a Boolean 
update function. For example, in Figure [5j we may orient the Pm. I -labelled edge 
between states 97 and 95 in the direction S97 -> sgs, in the direction S95 -*■ sgr, or 
decide that this is not a possible update. We also allow the edge to be directed 
in both directions. If S97 -» S95, we want a Boolean update function up u ,i that 
takes state S97 to state S95. Since there is no Pu.1 -labelled edge leaving state 
S150, we can also add the constraint that up u .i takes S150 to siso- 

We also add reachability constraints that restrict which edges are included 
and their orientation. Since the state space was constructed starting from a well- 
defined initial state, we would like to enforce the constraint that each non-initial 
state ought to be reachable by some directed path from the initial state. Since cell 
development proceeds hierarchically and unidirectionally, we favour short paths 
over long paths. This eliminates routes that seem biologically implausible, for 
example routes that cross a fate transition and then return to where they began. 

It also reduces the space of paths we have to search through. By increasing the 
lengths of allowed paths, we can increase the number of considered solutions. 

The results of applying our technique are shown in Figure [6] The method 
reconstructs the Boolean update functions for all but one gene ( EgrNab ), in some 
cases uniquely identifying the original function. We note that when multiple 
solutions are found for an update function, these solutions, while not exact, all 
provide useful regulatory information that could be verified experimentally. For 
example, both solutions for Scl successfully predict Sd' s activation by Gatal , 
although one of the two solutions omits its repression by Pu.l. 
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Synthesised update functions 

Gata2 A ->{Fogl v Pu.l) 

Gata2 A -<{Fogl v ( Pu.l A Cebpa)) 

Gata2A ->{Fogl v ( Pu.l A Gata2)) 

Gata2 A -■( Gata2 A (Pw. 1 v Fogl) 

Gata2 A ( P-u. Iv ( Gatal A Fogi ) ) 

Gata2 A ->{Pu.l v {Gata2 A Fogl)) 

{ Gatal v Cebpa) A ->Pu. 1 

{Gata2\J Fogl) A ->Pu.l 

(Gatal v Gata2) A -■ Pu.l 

( Gatal v Gata2v Fill) A ->Pu.l 

Other functions of the form (X V Y V Z) A ->Pu.l 

Gatal 

Gatal A -i Flil 
Gatal A ->EKLF 
Gatal 

Gatal A -i Pu.l 
Cebpa A ->{Fogl v SbZ) 

Cebpa A -<(Cebpa A (Sclv Fogl)) 

Cebpa A -<{Fogl A {Sclv Cebpa)) 

Cebpa A ->(Fogl v {ScIa Gatal)) 

Cebpa A -<{Fogl v {ScIa Gata2)) 

Cebpa A -i {Gatal A {Fogl V ScZ) 

Cebpa A ->{Scl\/ {Fogl A Cebpa) 

Cebpa A -i{Scl\/ {Fogl A Gatal) 

Pu. 1 A -i Gata2 

{Pu.l A Cebpa) A ->Gata2 

Pu. 1 A -<{ Gatal v Gata2) 

Other functions of the form Pu.l A ->{Gata2\/ X) 

Pu. 1 A ->{ Gata2 A Cepba) 

Pu. 1 A -■( Gata2 A Pu.l) 

Cebpa A -i {Gatal v Gata2) 

Cebpa A ->{Gata2\/ Fogl) 

(Cebpa v Pu.l) A ->{Gatal v Gata2) 

{Cebpa A Pu.l) A ->{Gatal v Gata2) 

Other functions of the form ( Cebpa V X) A ->{ Gata2 V X) 

Other functions of the form {Pu.l V X) A -<{Gata2\/ Y) 

Other functions of the form {Cebpa A Pu.l) A ->{Gata2\/ X) 

Pu. 1 a —i Gfil 

{cJun v Gatal) A -■ Gfil 

Cebpa A -lEgrNab 


Comments 


Unique 

Unique 

Unique 


Unique 

Incorrect with shortest paths 
Unique 


Fig. 6. Synthesised update functions. 

4 Background to Asynchronous Boolean Networks 

An asynchronous Boolean network (ABN) is B(V,U ), where V = {v\,V2, ■ ■ ■ ,v n } 
is a set of variables, and U = {iti, 112, ■ ■ ■, u n } is a set of Boolean update functions. 
For every u.i eU we have u.i : {0,1}” -*■ {0,1} associated with variable Vi. A state 
of the system is a map s : V -* {0,1). We say that an update function Ui is 
enabled at state s if Ui(s) + s(vi), i.e. applying the update function it,; to state 
s changes the value of variable iy. 

State s' = (d'i, d' 2r . ■ ■, d' n ) is a successor of state s = (e?i, ■ • • ,di,... ,d n ) 

if for some i we have itj is enabled, d[ = it,(s), and for all j * i we have d' = 
dj. That is, we get to the next state s', by non-deterministically selecting an 
enabled update function 14 and updating the value of the associated variable: 
s' = (di, d 2 , ■ ■ ■ ,Ui(di ),..., d n ). If no update function is enabled, the system 
remains in its current, stable, state, where it will remain: s' = s. 

An ABN induces a labelled transition system T = ( N,R ), where N is the 
set of 2 n states of the ABN, and R^NxVxN is the successor relation. Each 
transition (si,n,,S 2 ) is labelled with the variable Vi such that si(uj) + S 2 (vi). 
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The undirected state space of an ABN is an undirected graph S = ( N,E ), 
where each vertex n e N is uniquely labelled with a state s of the Boolean 
network, and there is an edge {si,S 2 } e E iff Sj and s 2 differ in the value 
of exactly one variable, v. The edge {si,S 2 } is labelled with v. In general, an 
undirected state space does not have to include all 2 n states induced by a Boolean 
network. 

An ABN B(V,U ) induces a directed state space on an undirected state space 
S = ( N,E ). Consider the transition system T = (2 V ,R) of B{U,V). Then, the 
induced directed state space is S' = (N,A), where (si, s 2 ) e A implies that there 
is a variable Vi such that (si,f j,S 2 ) 6 R- We say that (si,s 2 ) is compatible with 
Ui, if S 2 (vi) = u s (si), and for every j + i we have S 2 {vj) - S\(yj). 


5 Formal Definition of the Problem 

Our synthesis problem can be stated as follows: we are given an undirected state 
space S over a given set of variables V. We would like to extract a set of Boolean 
update functions that induce a directed state space from S such that each of 
the states in S are reachable from a given set of initial states. We also want to 
ensure that no additional, undesired states not in S are reachable, by ruling out 
transitions which ‘exit’ the state space. 

More formally, we are given a set of variables V = {v\,V 2 , ■ ■ ■, v n }, an undi¬ 
rected state space S = ( N,E ) over V , and a set I £ N of initial vertices. 

We would like to find an update function u t : {0,1}" -* {0,1} for each variable 
Vi € V, such that the following conditions hold. Let U = {«* | Vi e V} be the set 
of update functions. 

1. Every non-initial vertex s € N -1 is reachable from some initial vertex Si e / 
by a directed path in the directed state space induced by B(V, U ) on S. 

2. For every variable v r e V, let N,; be the set of states without an outgoing 
^-labelled arc. For every i we require that for each s e TV,;, iq(s) = s(vi). 


5.1 Generalising the Definition to Partial Data 

Since we intend to apply our method in an experimental setting, where we only 
have an incomplete sample from the possible states of the system, we relax 
this definition to extend it to partial data. Instead of requiring that every state 
is reachable from those initial states that we have measured, we only require 
that a set of final states are reachable. Instead of requiring that every undesired 
transition is ruled out, we seek to maximise the number of such transitions which 
are eliminated. This is formally stated next. 

As before, we are given a set of variables V = {vi,V 2 , • ■ •, v n }, an undirected 
state space S = ( N,E ) over V, and a designated set I £ N of initial vertices. 
In addition, we are given a designated set F £ N of final vertices, along with 
a threshold U for each variable i >* 6 V. The threshold t,; specifies how many 
undesired transitions must be ruled out. 



We would like to find an update function Ui : {0,1}" -*■ {0,1} for each variable 
Vi e V, such that the following conditions hold. Let U = {ui \ Vi e V} be the set 
of update functions. 

1. Every final vertex Sf e F is reachable from some initial vertex s, € I by a 
directed path in the directed state space induced by B(V, U) on S. 

2. For every variable V{ e V, let N t be the set of states without an outgoing re¬ 
labelled arc. For every i the number of states s e N t such that ui(s) = s(ie) 
is greater or equal to t t . 

In the remainder of the text, we refer to condition 1 as the reachability con¬ 
dition and condition 2 as the threshold condition. 

We restrict the search to update functions of the form /i A -i/ 2 , where fi 
is a monotone Boolean formula. The inputs to /i are the activating inputs to 
the gene and the inputs to / 2 are the the repressing inputs. This restriction was 
chosen after discussion with biologist colleagues and consultation of the literature 
(e.g., [2|[l3]). 

6 A Direct Encoding 

We start with a direct encoding of the search for a matching Boolean network. 
The search is parameterised by the shape of update functions (how many activa¬ 
tors and how many repressors each variable has), the length of paths from initial 
states to final states, and the thresholds for each variable. By increasing the 
first two parameters and decreasing the last we can explore all possible Boolean 
networks. 


6.1 Possible Update Functions 

In order to represent the Boolean update function for gene Vi, Ui = f\ a -./ 2 , 
we use a bitvector encoding. We represent the Boolean formula fj by a set of 
bitvectors, {ai,a 2 ,... a„}, ay € V u {v, a}, where each bitvector at represents a 
variable or a Boolean operator, and solutions take the form of a binary tree. For 
example, the formula v\ A (i ; 2 v« 3 ) is represented by the solution ai = A,a 2 = 
v,a 3 = Vi , a 4 = u 2 ,a 5 = u 3 . We restrict the syntactic form of possible update 
functions so that each variable appears only once, and each possible function 
has one canonical representation. For example, the function (zq A (z > 2 v t; 3 )) is 
included in our search space while (zq ao 2 )v (zq A V 3 ) is not. We search for 
functions up to a maximum number of activators, A t . and a maximum number 
of repressors, Ri. 

To encode the application of function u t to a state s, Ui(s ), we add impli¬ 
cations which unwrap the bitvector encoding of Ui to the constituent variables 
and logical operators; substituting values, s(zy), for variables, Vj, and directly 
mapping operations to logical constraints in the Boolean satisfiability formula. 
For example, the application of the function (iq v» 2 )a -.zj 3 to the state si is 
mapped to (si(z'i) v si(u 2 )) A --Si(u 3 ). 
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6.2 Ensuring Reachability 


To enforce the global reachability condition we consider all of the underlying 
directed edges in the undirected state space S = ( N,E ), and their associated 
single-gene transitions. 

Recall that we require every final vertex to be reachable from some initial 
vertex by a directed path in the directed state space induced on S by the Boolean 
network. That is, we require that every final vertex is reachable by a directed 
path, and that every Uj-labelled edge along this path is compatible with its 
associated update function, Uj. 

To enforce this we add constraints that track the compatibility of edges with 
update functions and define reachability recursively. We consider reachability by 
paths up to a maximum length: recall that we consider shorter paths to be more 
biologically likely. By iteratively increasing the length of the paths considered, 
we can obtain all satisfying models. 

We introduce a pair of Boolean variables , e ? , for each i^-labelled undi¬ 
rected edge {si,sj} e E, which track the value of the application of u t to s,; and 
to Sj (and the compatibility of the underlying directed edges (si,Sj) and ( Sj,Si ) 
with Ui). eij is true iff Ui(si) = Sj(v). 

We introduce an integer given by a bitvector encoding, r„, for each node 
n e N. Bitvector r n encodes the fact that node n is reachable from an initial 
node in r n steps, up to some maximum encodable value 2^" - 1. Bitvector r n is 
given a value of -1 to indicate that n is not reachable in this maximum number 
of steps. 

Reachability is then defined inductively: 

1. Initial nodes are reachable in zero steps: for every i e /, r,; = 0. 

2. A non-initial node s l is reachable in M steps if there is a compatible incoming 
edge ( Sj , Si) from another node Sj, and Sj is itself reachable in fewer than M 
steps. That is, for every n = Sj € TV - I and to = s,; € N such that {s*, Sj} e E 
we have e ij -* r m < r n . We also have that non-initial nodes cannot be reached 
in zero steps: For every n e N - I, r n = -1 v r n > 0. 

Finally, we add a constraint that every final node n € F is reachable from 
some initial node: r n + -1. 


6.3 Enforcing the Threshold Condition 

We enforce the threshold condition for each update function as follows. 

Consider an update function tq : V -> {0,1). We say that a node s e is 
negatively matched by iq if Ui{s) - s(vi). That is, by using ut as the update 
function of variable iq, Ui does not change the value of i\ from node s. We are 
searching for an update function such that a maximum number of nodes from 
Ni are negatively matched. 

We add a variable, ?n,; s for each node s € Ni to record whether Ui negatively 
matches s. We then add a constraint demanding that the number of negatively 
matched nodes is greater than or equal to the threshold: Y,szNi m is > U . 
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We search for satisfying assignments to the constraint variables encoding the 
representation of the Boolean update functions w, for all 'ty in V. The resulting 
synthesised Boolean network is the combination of these update functions. 

Unfortunately, in practice the direct encoding of the search does not scale to 
handle our experimental data. In the next section we suggest a compositional 
way to solve the problem. 


7 A Compositional Algorithm 

We now introduce our compositional algorithm, which scales better than the 
direct encoding given above. The problem of synthesising a Boolean network 
from the data is partitioned to three stages. Crucially, we avoid searching for 
a complete Boolean network and consider parts of the network that can be 
constructed independently. 


7.1 Pruning the Set of Possible Edges 


We start by building a directed graph from the given undirected state space 
S = ( N,E ), by considering which of the underlying directed edges in E are 
compatible with some Boolean update function, and pruning those that are not. 
We consider each underlying directed edge (si,S 2 ) and (s 2 ,si) of each of the 
ty-labelled undirected edges {si,S 2 } in E independently. 

We pose a decision problem for each directed edge (si,S 2 ): whether there 
exists some Boolean update function m satisfying the threshold condition (con¬ 
dition 2, 5.11 such that Ui(si) = s- 2 (vi). This is encoded as a Boolean satisfiability 
problem, adding constraints to represent the encoding of the update function, 
the threshold condition, and the evaluation of the function at the specific edge 
under consideration. We say that a satisfying function, u t . is compatible with 
(si,S 2 ). Once a compatible function has been found, it can quickly be evaluated 
outside the solver at other edges to try reduce the number of SAT queries we 
have to make. 

After making a query for each edge, we are left with a directed graph, which 
is the existential projection of all compatible update functions for each of the 
variables v e V. We have eliminated edges which have no compatible update 
function, and cannot participate in the reachability condition. On the example 
data set from Section [sj this step removes 18% of the possible edges. 


7.2 Ensuring Reachability 

We now come to the only part of the algorithm that considers the edges of all 
variables together, in order to enforce the global reachability condition (condi¬ 
tion 1, 5.1). This phase does not require the solving of a Boolean satisfiability 


problem, and as a result is very efficient. 

We construct, for each pair of initial nodes it / and final nodes /eF, the 
shortest path pif from i to f in the directed graph that was built in the previous 
phase of the algorithm. These paths can be computed via a breadth-first search. 
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Due to the edge pruning of the previous phase of the algorithm, if there is 
no path to a final node /, this implies that there are no satisfying models (at 
the given threshold and function size parameters). Otherwise, our reachability 
condition will be enforced by fixing a set of directed edges Pi for each variable 
Vi e V corresponding to these shortest paths. We will then require that the 
update function we search for, u t , is compatible with each of the edges in Pi. 

We choose, for each final node /, one path pf = pij from one of the initial 
nodes i. By fixing this path, we ensure that / is reachable from an initial node. 
We define pf\i as the set of relabelled edges in the path pf. We define P i: the 
relabelled edges which must be fixed to ensure reachability via the chosen paths, 
as the the set of relabelled edges in pf for each final node /: 

Pi = U {(si, s 2 ) | (s 1 ,s 2 ) € P f\i} (1) 

PF 

By considering only the edges in P*, we can search for an update function 
for Vi independently of all other variables, while ensuring the global reachability 
condition holds. 


7.3 Final Update Functions 


We can now search for the update function of variable ry, rq, independently of 
all other variables. We fix the Uj-labclled edges computed in the previous phase 
and encode the search for rq as a Boolean satisfiability problem. 

As before we add constraints to encode the representation of Uj, and to 
enforce the threshold condition. We fix each of the relabelled edges (si, s 2 ) e P t 
to establish reachability, by adding a conjunction requiring that rq is compatible 
with each of them: rq(si) = s 2 (vi). 

We search for satisfying assignments of the constraint variables encoding rtj, 
using an ALLSAT procedure to extract all possible update functions for variable 
Vi. This gives rise to a set of update functions per variable and a set of Boolean 
networks from the product of the set of update functions per variable. 

We note that this final phase of the algorithm can fail to find update functions 
for a variable Uj, because there are no possible update functions compatible with 
all of the path edges Pi that were computed in the previous phase. That is, while 
each edge in P, is individually compatible with some update function, there may 
be no update function that is compatible with every edge in Pi. In order to 
cope with this limitation, we can extract the minimal unsatisifiable core of the 
Boolean formula, and search for replacement paths that exclude incompatible 
combinations of edges. This step can be iterated until satisfying solutions are 
found for all variables, or until no path can be found, implying that there are 
no valid models. 

By extending our search from the shortest paths between initial and final 
node pairs in the directed graph to the fc-shortest paths between pairs and in- 


cremementally increasing k 26 , we can increase the number of possible update 


functions that we consider. In the limit, we will obtain all satisfying models. 
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Data set 

Genes 

States 

Direct (seconds) 

Compositional (seconds) 

CMP (synthetic) 

11 

214 

25 

77 

Blood stem cells 

21 

753 

Out of Memory 

5114 

Embryonic (66% of states) 

33 

956 

Out of Memory 

3364 

Embryonic (full) 

33 

1448 

Out of Memory 

8709 


Fig. 7. Performance of direct encoding and compositional algorithm on example data 
sets. 

An implementation of our algorithm, which is written in and uses Z3 
as the satisfiability solver, is available at https: //github. com/swoodhouse/ 
SCNS-Toolkit. In Figure [7] we present experimental results from running our 
implementation of the direct encoding from Section [6] and compositional algo¬ 
rithm on four data sets: the small synthetic data set from Section [3j the large 
embryonic experimental data set from Section[2j and a second experimental data 
set covering blood stem cells. We also show results from rerunning on the em¬ 
bryonic data set with a third of states removed. All experiments were performed 
on an Intel Core i5 @ 1.70GHz with 8GB of RAM, using a single thread. 

While the direct encoding synthesised a matching Boolean network on the 
small synthetic data set faster than our compositional algorithm, it cannot scale 
to the real experimental data sets, quickly running out of memory. The composi¬ 
tional algorithm, on the other hand, can scale to handle real data sets of the sort 
produced by our experimental collaborators. All experiments terminated within 
a few hours, when running on a single thread. The compositional algorithm can 
easily be parallelised over variables, which would further increase its efficiency. 


8 Application to the Experimental Dataset 

We now return to the experimental data set introduced in Section [2] 

Recall that cell measurements were taken from four sequential developmental 
time points, and that the state graph resulting from discretisation of the data 
(Figure [2]) exhibited a clear separation between the earliest developmental time 
point (states coloured blue) and the latest (states coloured red). We applied our 
synthesis technique to this data, taking the initial states to be the states from 
the first time point, and the final states to be the states from the latest time 
point. For complete details, we direct the reader to 
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The result of the synthesis was a set of possible Boolean update functions 
for each of the 33 genes, with several genes having a uniquely identified update 
function. By applying standard techniques for the analysis of Boolean networks, 
we found the stable state attractors and performed computational perturbations. 
The synthesised network, along with the subsequent computational analysis led 
to a set of predictions which were then tested experimentally. We found that 
our results were robust when performing bootstrapping, removing a third of the 
data at random and rerunning the synthesis algorithm. 

Our experimental collaborators were able to validate key predictions made by 
our analysis. The update function for one of the genes at the core of this network, 
Erg , which directly activates many other genes, was tested experimentally by a 
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range of assays. Evidence was found that the activators specified in the gene’s 
synthesised update function ( Hoxb4 and Soxl7) do indeed activate expression of 
the gene, and furthermore in a fashion consistent with the Boolean “OR” logic of 
the synthesised update function. This could be regarded as a “local” validation 
of our model, testing two of the directed edges in the network. 

Computational perturbations to another gene at the core of the network, 
Sox7, indicated that when Sox7 was forced to always be expressed, stable states 
corresponding to cells from the final developmental time point (blood progeni¬ 
tors) no longer exist. Cell differentiation assays confirmed this prediction exper¬ 
imentally, finding that when this gene was forced to be expressed, the number 
of cells which normally emerge at this final time point is significantly reduced. 
This can be thought of as a “global” validation of our model, as it is a prediction 
about the behaviour of the whole network under a certain perturbation. 


9 Related Work 


Previous analyses of single-cell gene expression data have mostly been based 
on statistical properties of the data viewed as a whole, such as the correlation 
in the level of expression of pairs of genes [8 ( 15 . Such analysis cannot recover 
mechanistic Boolean logic, does not infer the direction of interactions and cannot 
easily distinguish direct from indirect influence. 

Boolean networks were introduced by Kauffman in order to study random 
models of genetic regulatory networks 10 . They have since been applied in 


a range of contexts, from modelling blood stem and progenitor differentiation 
013 , to the yeast apoptosis network [TT] , to the network regulating pluripotency 
in embryonic stem cells [9]. BDD-based algorithms for state-space exploration 
and finding attractors of Boolean networks have been introduced (7 27 . 

Synthesis is the problem of producing programs or designs from their speci¬ 
fications. In recent years much progress has been made on the usage of SAT and 
SMT solvers for synthesis. Essentially, the existence of a program that solves a 
certain problem is posed as a satisfiability query. Then, a solver tries to search 
for a solution to the query, which corresponds to a program. For example, Srivas- 
tava et. al. 22 23 show that the capabilities of SMT solvers to solve quantified 


queries enable the search for conditions and code fragments that match a given 
specification. Similarly, Solar-Lezama et. al. 21 build a framework for writing 
programs with “holes” and letting a search algorithm find proper implementa¬ 
tions for them. The approach of reactive synthesis 19 is similar to ours in the 


type of artefact that it produces. However, the techniques that we are using are 
more related to those explained above. Recently, Beyene et. al. l] have shown 
how constraint solving can be used also in the context of reactive synthesis. 

Synthesis has recently been applied in the context of biology. Koksal et. al. show 
how to synthesise state-machine-like models from gene mutation experiments us¬ 


ing a novel counterexample-guided inductive synthesis (CEGIS) algorithm 12 


Their approach uses constraint solvers to search for program completions that 
match given specifications, as explained above. Both the data and the type of 
model are different to those dealt with here, which called for a new approach. 
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Recently, there have been several applications of synthesis to Boolean net¬ 
works. Dunn et. 


al. 


and Xu et. al. 25 show how to fit an existing static, 


topological regulatory network for embryonic stem cells to gene expression data 
in order to obtain an executable Boolean network, under the assumption that 
experimentally measured data represent stable states of the system. This as¬ 
sumption may be appropriate for cell lines maintained in culture, but it does 
not adapt well to developmental processes such as ours, where cells are transit¬ 
ing through intermediate states in order to develop into a particular lineage. 

Recent work of Karp and Sharan 20 shows how to synthesise Boolean net¬ 
works given a topological network and a set of perturbation experiments, by 
reduction to integer linear programming. In 17 , Paoletti et. al. synthesise a 


related class of models (which incorporate timing and spatial information) from 
perturbation data, via reducion to SMT. To the best of our knowledge, our ap¬ 
proach is the first to synthesise gene regulatory network models directly from 
raw gene expression data, without the need of either genetic perturbation data 
or a-priori information about the topology of the network. 


10 Conclusions and Future Work 

We presented a technique for synthesising Boolean networks from single-cell 
resolution gene-expression data. This new and exciting type of data allows us to 
consider the state of each cell separately, giving rise to “state snapshots”, which 
we treat as the states of an asynchronous Boolean network. Our key insight is 
that the update functions of each variable can be sought after separately, giving 
rise to reasonably sized satisfiability queries. We then combine the single gene 
update functions by considering the flow of time included in the data. 

We are able to reconstruct rules from a manually curated Boolean network 
and produce a set of possible Boolean networks for the given experimental data, 
for which no similar curated Boolean network is available. The discussion with 
biologists about this Boolean network led to a set of predictions, which were 
then experimentally validated in the lab. 

We are awaiting similar data from additional experiments to apply the same 
technique to. At the same time, we are considering the usage of advanced 
search techniques, as used in this paper, to the analysis of other types of high- 
throughput data. Future work in the experimental domain includes the validation 
of more of the links in our synthesised network, and the design of further gene 
perturbation experiments motivated by the results of computational perturba¬ 
tions. An interesting question for future research is whether techniques like ours, 
which achieve scalability by treating different aspects of a graph data structure 
seperately, are applicable to other domains where graph-like data is generated. 
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